Conversation
`generate_reduce_scaler` hardcoded 2048 bytes and 4 faces, assuming full 32x32 bf16 tiles. When circular buffers use half tiles (1024B, 2 faces), this overwrites adjacent L1 memory causing watcher-detected corruption. Restore the `half_tile` template parameter (previously removed in cleanup) so the zero-fill size and face iteration adapt to the actual tile dimensions. Also fix idle core runtime args count mismatch in sdpa_decode_program_factory. Fixes: #37631 Fixes: #29225 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The watcher skip for issue #37631 was prematurely removed. Restore it until the underlying issue is fully resolved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/codeowners ping |
CodeOwners Group AnalysisThis PR requires approval from one member of each of the following groups: Summary: 2 pending groups, 0 approved groups Group Information:
Note: At least one approval from each group is sufficient. |
|
Hi Evan Smal (@esmalTT), Raymond Kim (@tt-rkim), this PR Fix SDPA TT_METAL_WATCHER issues by Pavle Josipović (@pavlejosipovic) needs your approval/review to merge this. |
There was a problem hiding this comment.
Pull request overview
This PR fixes TT_METAL_WATCHER-detected corruption/issues in SDPA decode by ensuring the reduce-scaler generation logic matches the actual circular buffer tile size (full vs half tiles) and by aligning idle-core runtime argument counts with what the decode reader kernel expects. It also updates SDPA prefill unit tests to remove watcher skips now that the underlying issue is addressed.
Changes:
- Restore half-tile awareness for
generate_reduce_scalerand pass the correct half/full-tile mode from the SDPA decode writer kernel. - Fix idle-core reader runtime-arg vector length in
SdpaDecodeProgramFactoryto match the reader kernel’s expected arg reads. - Remove
TT_METAL_WATCHERskip decorators fromtest_sdpa_prefill.py(per PR description: now passing with the fix).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/sdpa_decode_program_factory.cpp |
Fix idle-core reader runtime-arg count to prevent watcher OOB runtime-arg access. |
ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/dataflow/writer_decode_all.cpp |
Detect half-tile scalar CBs and invoke generate_reduce_scaler with the correct template mode. |
ttnn/cpp/ttnn/kernel/dataflow/generate_reduce_scaler.hpp |
Reintroduce half_tile template parameter to size the zero-fill and face-looping correctly. |
tests/ttnn/unit_tests/operations/sdpa/test_sdpa_prefill.py |
Remove watcher-enabled skips now that corruption/OOM issues should be resolved by the kernel fix. |
Summary
generate_reduce_scalerhardcoded 2048 bytes and 4 faces, assuming full 32x32 bf16 tiles. When circular buffers use half tiles (1024B, 2 faces), this overwrites adjacent L1 memory causing watcher-detected corruption.half_tiletemplate parameter so the zero-fill size and face iteration adapt to the actual tile dimensions. Also fix idle core runtime args count mismatch in sdpa_decode_program_factory.Fixes: #37631
Fixes: #29225
Replaces #37833 (closed due to bad rebase)
Test plan
🤖 Generated with Claude Code